Setting Up


In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os

In [19]:
pd.set_option('display.max_colwidth', 1000)

In [2]:
DATA_DIR = '../data/'
SEED = 12

Clean and Prep Wiki Data


In [3]:
import pandas as pd

In [4]:
toxicity_annotated_comments = pd.read_csv(os.path.join(DATA_DIR, 'toxicity_annotated_comments.tsv'), sep = '\t')
toxicity_annotations = pd.read_csv(os.path.join(DATA_DIR, 'toxicity_annotations.tsv'), sep = '\t')

In [5]:
annotations_gped = toxicity_annotations.groupby('rev_id', as_index=False).agg({'toxicity': 'mean'})
all_data = pd.merge(annotations_gped, toxicity_annotated_comments, on = 'rev_id')

In [6]:
all_data['comment'] = all_data['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
all_data['comment'] = all_data['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

# TODO(nthain): Consider doing regression instead of classification
all_data['is_toxic'] = all_data['toxicity'] > 0.5

In [7]:
# split into train, valid, test
wiki_splits = {}
for split in ['train', 'test', 'dev']:
    wiki_splits[split] = all_data.query('split == @split')

In [8]:
#for split in wiki_splits:
#    wiki_splits[split].to_csv(os.path.join(DATA_DIR, 'wiki_%s.csv' % split), index=False)

Prep debiasing data


In [9]:
def augment_with_data(source_df, target_path, target_name, sep = '\t', write = False):
    target_df = pd.read_csv(target_path, sep = '\t')
    target_df['sample'] = target_name
    target_splits = {}
    for split in source_df:
        target_splits[split] = pd.concat([source_df[split],
                                          target_df.query('split == @split')]).sample(frac = 1, random_state = SEED)
        if write:
            target_splits[split].to_csv(os.path.join(DATA_DIR, 'wiki_%s_%s.csv' % (target_name, split)), index=False)
    return target_splits

In [10]:
debias_splits = augment_with_data(wiki_splits, '../data/toxicity_debiasing_data.tsv', 'debias')

In [11]:
wiki_splits['train'].shape


Out[11]:
(95692, 9)

In [12]:
debias_splits['train'].shape


Out[12]:
(99157, 9)

Prep random data


In [13]:
random_splits = augment_with_data(wiki_splits, '../data/toxicity_debiasing_data_random.tsv', 'debias_random')

In [14]:
random_splits['train'].shape


Out[14]:
(99157, 9)

In [21]:
random_splits['train'].query('sample == "random"')


Out[21]:
comment is_toxic logged_in ns rev_id sample split toxicity year
25577 ` == The Halo's RfA == {| |- | |valign=top|Mark Dingemanse...Thank you very much for the constructive criticism in your oppose comment in my request for adminship. Ultimately, no consensus was reached, and I failed to be promoted, but I am very grateful for your coments. I will strive to better myself in all areas, especially Mainspace. |} ` False True user 7.7269e+07 random train 0.0 2006.0
150589 : You've changed your point several times. Sorry if I missed it. False True article 6.42874e+08 random train 0.0 2015.0
144969 ` *Neither admins nor non-admins should count votes, as you and I both already know. No actual case for either having primary topic was made. ` False True user 6.1052e+08 random train 0.0 2014.0
41923 It's interesting that this IP feels they know me so well, when I have had limited interaction with you. We have never actively engaged each other on talk pages. It's amazing you somehow have been able to edit every page I edit; every photograph you crop is mine. Have you wondered where this ire against me and my work has come from? You continually put a photo of a man in a wig (a photoshopped wig, at that) on the Afro page, and four different editors have removed it, five if you include me. This is what you consider ownership and edit warring? Then you have taken my photographs and decided to rename them simply to remove David Shankbone out of the file name. Your intentions are no pure. You are gaming the policies and guidelines to Wikistalk me. You say not to bite the newcomers, but I haven't actively engaged you. If you are a newcomer, how come you have such a handy knowledge of Wiki policies and guidelines? Why is every page in your history one I have contributed to... False True user 1.33966e+08 random train 0.0 2007.0
134484 :::::I guess it comes down to substantial and my experience with the Danish version of it being being a bit more loosely defined (such as there is no actual demand of the ball being a tennis ball or who throws the ball). False True article 5.49524e+08 random train 0.0 2013.0
59535 ` Please stop. If you continue to vandalize Wikipedia, you will be blocked from editing. ` False True user 1.95916e+08 random train 0.2 2008.0
39227 ` :I removed them. ` False True article 1.23612e+08 random train 0.0 2007.0
104536 ` == Stray punctuation == FYI, I believe this is now fixed per your query at . Thanks! ` False True user 3.81579e+08 random train 0.0 2010.0
83332 REDIRECT Talk:List of diplomatic missions of Switzerland False True article 2.8586e+08 random train 0.0 2009.0
3429 More Dutch speakers == I would say about 23 million people speak Dutch rahter than 20 million: * Netherlands 16 million * Belgium 6 million * Suriname, Antilles, other communities 1 million * total 23 million [anonymous] I don't know where you get those numbers, where comes the 1 million number? population: * Antilles 212,226 * Aruba 103,000 * Suriname 438,144 the population of these three regions doesnt reach one million, and Dutch is definetly not a national language in these places. Plus, the Netherlands and Belgium have a lot of immigrants. So I belive more in the 20 million figure, rather than the 23 million one.- 5 July 2005 11:27 (UTC) :There are for example probably about 12 million native speakers in the Netherlands, once you sutract immigrants, Frisian, Limburgish, and other Germanic language speakers. Suriname definitely isn't natively Dutch speaking; some of the languages ... False True article 1.82048e+07 random train 0.0 2005.0
81674 what the hell i think im a vampire hence the name and im a schoolgirl so dont diss. Btw i am cleaning this up i read all the books and am obsessed. TwilightVampire4Ever False True article 2.78791e+08 random train 0.3 2009.0
117111 :I created this account because my other one got blocked. There's no rule against that. Yes there is; it's called block evasion. As far as me having something against you personally, I've never heard of you before or seen anything of you before you put up the unblock request; as a checkuser, my tools include the ability to check to see if people are, as you were, abusing multiple accounts. False True user 4.45742e+08 random train 0.1 2011.0
3836 *Willmcw: Okay, but I don't undersatnd what you mean about mirrors and links? Could you elaborate, just for my edification? Also, what would you have me do? You are deleting my personal name; so is it okay, if this happens, if I add at least an external link to my site, or do you have some vendetta against me? I think you have the wrong impression of me. I would like to make a fresh start with you and be on good terms (and no, I don't expect you to change your vote to delete my entry). I am willing to try to make it clear that I am sincere and in good faith. Please let me know what I can do to demonstrate that to your satisfaction. Yesterday I indeed started a few entries, and am enjoying this. I want to do it right, but am an amateur at some of the rules and policies and editing techniques. Sincerely, Stephan Kinsella False True user 1.92391e+07 random train 0.0 2005.0
98293 ` == Technical Problems == There is no mention in the article about the Yellow Light of Death (YLOD) which indicates a general hardware failure on the PS3. There is ample discussion of this matter and a wide body of citations available; Currently 9M returns for ``yellow light of death`` http://search.live.com/results.aspx?q=yellow+light+of+death&form;=QBLH&filt;=all and 1.9M returns for ``yellow light of death`` http://www.google.ca/search?hl=en&q;=yellow+light+of+death&meta;=&aq;=f&oq;= There is also zero mention the bricking problems Sony has while updating PS3 firmware. There is ample discussion of this matter and a wide body of citations available, this is but one; http://gizmodo.com/5021399/playstation-3-firmware-24-bricking-some-ps3s Something should be mentioned in a new section, titled ``Technical Problems``. :Addition of ``YLOD`` has been discussed here in the past and the main reason it hasn't been included is that there weren't any reliable, notable so... False True article 3.52411e+08 random train 0.1 2010.0
112105 :* I've tried to reword and reduce this bit False True article 4.19913e+08 random train 0.0 2011.0
112882 Okay, thank you, you just clearly violated the civility rule by calling me stupid, that's just a bonus. False True user 4.23551e+08 random train 0.3 2011.0
93403 :In context that he was a founder-member, 'played with' is fine to mean he was still a member: if wished you could clarify by saying 'he continued to play with them...' but it would be a bit long-winded. The second 'Melos' bluelink is entirely optional: the article isn't long enough to make it really necessary, but repeat bluelinks from the intro paragraph into the main text are not uncommon and do sometimes help. I can't decide if 'premiere' for the Britten works should be singular or plural. A footnote source for the appointments at Michigan would help, as this is told 'on trust'.If there is more to add, it would be nice to draw the sentences together a little more into two or three paragraphs. But these are minutiae, quite at your discretion....! - all is well. False True user 3.29427e+08 random train 0.0 2009.0
102209 == Trade to Wizards == On NBA draft night the bulls traded Kirk Hinrich to the wizards for the 17th pick in the nba draft and a future 2nd round pick. see story here http://voices.washingtonpost.com/wizardsinsider/2010/06/report-wizards-trade-for-kirk.html False True article 3.71123e+08 random train 0.0 2010.0
157610 Ca'Foscari University - History of English Culture Project Hi everyone! I'm a postgraduate student from Ca' Foscari University, Venice. As an assignement for my History of English Culture course I have chosen to expand this article. As a provisional bibliography I have selected these sources: Tim Hitchcock and Robert Shoemaker, Tales from the Hanging Court, London, Hodder Arnold, 2007 www.oldbaileyonline.org If you have any suggestions or advice, I would really appreciate it. Feel free to talk to me here or on my talk page! False True article 6.86479e+08 random train 0.0 2015.0
107412 ::: If we're not permitted to discuss the subject matter, then I will cease. It is disappointing though that there is an alternative interpretation of the halting problem proof which isn't represented in the article. False False article 3.95864e+08 random train 0.0 2010.0
142178 ==DYK nomination of The FP== Hello! Your submission of The FP at the Did You Know nominations page has been reviewed, and some issues with it may need to be clarified. Please review the comment(s) underneath your nomination's entry and respond there as soon as possible. Thank you for contributing to Did You Know! v/r - False True user 5.95401e+08 random train 0.0 2014.0
127841 Your city looks very pretty from the wiki images on the main article. I don't how far I come, if so i'll just place on hold till the 10th False True user 5.05194e+08 random train 0.0 2012.0
113152 ` :The article just says he's a person that has worked for a number of companies. –– ` False True user 4.24933e+08 random train 0.0 2011.0
134282 ` ==Your warning== You might want to redirect that warning to the person who created the page and added the content ;) ` False True user 5.48051e+08 random train 0.0 2013.0
129454 ` : Dues to SUL conflict! ` False True user 5.15392e+08 random train 0.0 2012.0
140837 ` :That sounds like Cohen v. California.   ` False True article 5.87635e+08 random train 0.0 2013.0
37836 ` ==A view of a Japanese columnist== A conservative columnist, Hideaki Kase writes in NewsWeek:The fact is that the brothels were commercial establishments. U.S. Army records explicitly declare that the comfort women were prostitutes, and found no instances of ``kidnapping`` by the Japanese authorities. It's also worth noting that some 40 percent of these women were of Japanese origin.` False True article 1.18503e+08 random train 0.3 2007.0
117143 ` ==Sockpuppetry case== {| align=``left`` style=``background: transparent;`` || |} Your name has been mentioned in connection with a sockpuppetry case. Please refer to Wikipedia:Sockpuppet investigations/Jaimatadi000 for evidence. Please make sure you make yourself familiar with the guide to responding to cases before editing the evidence page. ` False True user 4.45986e+08 random train 0.1 2011.0
34091 ` :Please go right now and read Wikipedia policy. You must treat other users with respect and not address them in such a confrontational tone. Do not insinuate that they are morons for taking certain positions on this issue. :Now, to your point. No one is claiming that the image of Muhammad is any more ``real`` than the image of Qin Shi Huang. They are both done in later times to represent an historical figure. Muslims do have a tradition of representing the prophet in images even if it is not the most prevalent. What the image on the page represents is part of the tradition that Muslims have in representing Muhammad. Now, feel free to enter into discussion about whether or not you believe such images are important enough to warrant inclusion on Muhammad but make sure you are arguing about depictions in Muslim tradition and not merely your own personal feelings. Also note that Depictions of Muhammad (AVOID if you will be offended) has various images of MuhammadMuslim and n... False True article 1.05739e+08 random train 0.1 2007.0
62361 == NO == NO SOUP FOR YOU sorry - couldn't resist! False True user 2.04994e+08 random train 0.1 2008.0
... ... ... ... ... ... ... ... ... ...
103637 :::Thank you for that reference, I think it's assisted me greatly. Some of the things that it suggests should be there are obviously not applicable (there's no plot!) but overall I think it's given me a lot of help with the tone. I've now re-written the lead keeping these principles and your comments in mind and I think it's an improvement, what do you think? False True article 3.77424e+08 random train 0.0 2010.0
138908 ::Hi, adding my thoughts here. While you are welcome to your opinion of Manning's gender, it is not germane to the title discussion, and the guidelines we developed in advance of the discussion suggested that we keep such opinions to ourselves. That said, I would ask those writing here to consider dropping similar notes on the pages of those who share their opinion that Manning is a woman, as that is not a valid argument for a title change either. False True user 5.75297e+08 random train 0.1 2013.0
66939 ` :::let me point out first that you're both falling into a very common pattern here. I've seen it a million times, in and out of Wikipedia: when a dispute is right on the verge of being resolved, frustrations and grumpinesses start to rise - not from the dispute per se, but from the kind of mild insulted feeling that's always a part of having a dispute. don't let it get ya. -) :::DGT, the ISBN is 1-58705-040-4. :::don't worry about a PDF version - Jeremy wasn't responding to the discussion that we were having directly, but just speaking in general. I think if we can all hang in with the Good Faith beliefs for just a little bit more while Lee de-commercializes the HTML version, we can let this discussion go and move on to bigger and better things. ` False True article 2.20873e+08 random train 0.0 2008.0
148964 ` ::::: :::::That is fine with me. AFAICT, the tool silently ignores references it cannot parse, so a simpler alternative would be to list those references along with the ones that resulted in HTTP errors. ::::: ` False False user 6.32827e+08 random train 0.0 2014.0
112024 ` ``Cattle class`` is another (possibly regional, definitely slang) nickname for economy class. Might or might not be worth including. And check the link for Via Rail, it is a Canadian company, not American. There isn't any routes in the US either.` False False article 4.19529e+08 random train 0.0 2011.0
15804 ` :You will need to provide link diffs in the evidence section. Миборовский |||M|E|! ` False True user 5.26015e+07 random train 0.0 2006.0
83283 == Easter Egg? == decided to leave it in and make it an easter egg.[citation needed] AFAIR, it's not an easter egg, as you simply GO INTO the next level, when you've won the first level, right ? False True article 2.85591e+08 random train 0.0 2009.0
39246 Paradox? Don't you mean conundrum. a paradox is something I thought to be impossible!! And this is not. False False article 1.23676e+08 random train 0.0 2007.0
44248 WikiProject Cold War history}} {{ False True article 1.42515e+08 random train 0.0 2007.0
56128 ` You'll do Whatever it takes to change the subject of why you and kos should be banned. ``D*ck`` is vandalism. It's just the politics here are fucking awful. Go ahead. Change the subject again. Whatever it takes to change the subject. Comedy. KOS was trying to justify the ban, in a debate, but in the middle I got banned. He couldnt come up with a reply so he banned me.. Comedy. Then he tried again, but I replied and got banned. Go back to suckin his dick k.` True False user 1.8565e+08 random train 0.9 2008.0
43774 ` ::::Oh, I thought that this might have come into play: Others' comments It is not necessary to bring talk pages to publishing standards, so there is no need to correct typing errors, grammar, etc. It tends to irritate the users whose comments you are correcting. Never edit someone's words to change their meaning. Editing others' comments is not allowed. There are exceptions, however. Some are: * If you have their permission * Removing prohibited material such as libel and personal details * Removing personal attacks and incivility. This is controversial, and many editors do not feel it is acceptable; please read WP:ATTACK#Removal of text and WP:CIVIL#Removing uncivil comments before removing anything. * Unsigned comments: You are allowed to append or one of its variants to the end of someone's comment if they have failed to sign it. The form is {{subst:unsigned|USER NAME OR IP}}, which results in —The preceding unsigned comment was added by • .. * Interruptions: In some... False True user 1.4068e+08 random train 0.0 2007.0
8108 == Hello == Thanks for you greeting message. I became intereted in wiki because I wanted to edit my Oxford college page - thought it was too drab and I cant bear it any longer. I am sure you will be a big help to a new comer like me. Thanks a bunch in advance! - Mike False True user 3.24775e+07 random train 0.0 2005.0
124001 ::Does that mean that within the Occupied Territories, 502 of the 515 criminal suits that year related to right wing Jewish settlers, or within Israel at large, 502 of the 515 criminal suits that year related to right wing Jewish settlers in the Occupied Territories?Best Wishes False True article 4.83901e+08 random train 0.0 2012.0
25626 Superfluous - what's a University for if not academic achievments, football? False False article 7.73992e+07 random train 0.0 2006.0
80116 :This is about you, not about other users. You are very, very fast to apply tu quoque fallacies when you're called on your behaviour. I have advised several editors previously about the recommendations of the above-mentioned essay, and will advise others if I see them engaging in similar behaviour. Please consider the recommendation of Matthew 7:5. False True user 2.72511e+08 random train 0.1 2009.0
47966 ` == Lead section == I believe that the following sentence: :``When current is applied to the capacitor, electric charges of equal magnitude, but opposite polarity, build up on each plate.`` is more accurate than this version: :``When charge is made to flow into the capacitor, electric charges of equal magnitude, but opposite polarity, build up on each plate.`` You said in your edit summary that ``charge flows, current is a flow``, implying that current doesn't flow (verb). You are correct, so it's a good thing the original version didn't claim that ``current flows``! Also, the phrase ``charge is made to flow into the capacitor`` isn't really accurate, as overall, charge neither leaves nor enters the capacitor. Regards, ` False True article 1.5637e+08 random train 0.0 2007.0
73646 Water is densest at 4 degrees Celsius, but it’s not that much denser than water at other temperatures. Exploiting tiny differences in the density of water of different temperatures would be a very inefficient way of producing power. The only ways I know of that produce significant amounts of power from cold water are hydroelectric power (dams), tidal power, wave power, and power from ocean currents (analogous to wind power). Each of these involves falling water or water that is moving somewhat rapidly. False True article 2.46234e+08 random train 0.0 2008.0
15926 :Hi Deskana. I have replied to Jitse on his talk page; take a look. × False True user 5.30151e+07 random train 0.0 2006.0
95860 == David Carradine == You should realize what is vandalism and what isn't. Accusing me of vandalism is not constructive. False True user 3.40808e+08 random train 0.1 2010.0
80752 end quoted material. False True article 2.751e+08 random train 0.0 2009.0
32758 In your recent edit, you say that you are a real band, but are new to Wikipedia. Allow me to point out a few areas where you could improve. *All information in an article must be true, and verified by outside sources. I seriously doubt that this band sold 999,727,163 of anything, but you are welcome to add a link to a Rolling Stone article about this astounding achievement. *An article shouldn't be created by someone directly involved in the subject- for example, a band member shouldn't create an article about her own band. - False True user 1.00875e+08 random train 0.0 2007.0
106568 :::To me it has only two: a nation and open fields with grazing cows and sheep instead of smog, crowded streets and cars. False True user 3.91449e+08 random train 0.2 2010.0
8667 `Biblical allusions== I was thinking of putting this into the liteary allusions sections, but I'd like some comments and a quote check first. ``The film makes a number of seemingly biblical references, intrestingly it is mostly the Sith who make them. These possible references include: ``It is done`` ``I do not know you anymore`` ``If you are not with us, you are against us`` ( Though the last one might be more of a political reference to George Bushes use of the same quote).`` == ` False False article 3.38121e+07 random train 0.1 2006.0
16520 ` Fine, gentlemen, but as it appears it's only ourselves having this ``discussion``, allow me to submit this for your inspection, as sent by Turner May 19 to all media, the TPS and its Board (blocked by the latter two) and yes, Fantino himself, in response to the latest headlines. If you ever change your minds... Original to Leo Kinahan, counsel for Sgt. Jim Cassells Mr. Kinahan: Having discovered how sensitive you are about your e-mail address, note that I have included you as a BCC to protect your privacy. As I have been battling with the media ever since my saga began, I am only including them, at this time, in the hope that - considering the recent brave actions of your client and the resultant headlines which have followed - they may, finally, be forced to recognize my plight and resultant complaint since it all began. As briefly as I can describe it, I was set up by 55 Division in November of ' 98 with a manufactured charge of Criminal Harassment as a favour to their ... False False article 5.4644e+07 random train 0.0 2006.0
68912 ==Proposed deletion== Delete - body of work doesn't justify any indication of notability plus Kyle doesn't want it here. False True article 2.27893e+08 random train 0.0 2008.0
103237 As I see the same quote is also present in this book. False True article 3.75422e+08 random train 0.0 2010.0
125131 ` == Romney (neologism) == Is there reason to believe that this coined definition of ``Romney`` is anything more than a bit of non-notable WP:NOTNEWS? There is one article linked to CNN, which is of course a reliable source, but that doesn't mean that this is necessarily a definition which can be considered notable. I think there might be WP:BLP issues as well, though I am not as well versed as I should be in BLP policies so someone can correct me if I'm wrong...actually I think I'll ask because he knows his BLP policy well. In the Santorum case, Dan Savage was at least a notable person to begin with, but who is Jack Shepler and why is anything he says notable? :Looked into the issue and would have to agree with the points you make - I'm not seeing that the attempt at creating a neologism ended up being at all notable. ::Yes, no long term notability - this whole article has partisan attack issues. Its incredibly bloated and wants stripping to a couple of sentences and ... False True article 4.90454e+08 random train 0.0 2012.0
66934 ` ::``Troll``: I'm sure, but what has happened and will continue to happen if is exactly why this kite is not yet ready to fly. You must have concensus, and it does not yet exist! ` False False article 2.2086e+08 random train 0.1 2008.0
32852 == What about rsync.net? == Anyone mind if rsync.net is added to the Managed backup service providers list? I was going to add it but I didn't know the wiki naming convention: RsyncDotNet or Rsync_Net_(website) or ...? False True article 1.01258e+08 random train 0.0 2007.0
132886 *And just to be clear, an interview with an anonymous person claiming to be an Alawite during a time of conflict isn't exactly reliable and scholarly, when the purpose of said interview is demonisation of a religious group. False True article 5.36603e+08 random train 0.0 2013.0

48395 rows × 9 columns


In [ ]: